TextFlow: A Text Similarity Measure based on Continuous Sequences

نویسندگان

  • Yassine Mrabet
  • Halil Kilicoglu
  • Dina Demner-Fushman
چکیده

Text similarity measures are used in multiple tasks such as plagiarism detection, information ranking and recognition of paraphrases and textual entailment. While recent advances in deep learning highlighted further the relevance of sequential models in natural language generation, existing similarity measures do not fully exploit the sequential nature of language. Examples of such similarity measures include ngrams and skip-grams overlap which rely on distinct slices of the input texts. In this paper we present a novel text similarity measure inspired from a common representation in DNA sequence alignment algorithms. The new measure, called TextFlow, represents input text pairs as continuous curves and uses both the actual position of the words and sequence matching to compute the similarity value. Our experiments on eight different datasets show very encouraging results in paraphrase detection, textual entailment recognition and ranking relevance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

A Geometric View of Similarity Measures in Data Mining

The main objective of data mining is to acquire information from a set of data for prospect applications using a measure. The concerning issue is that one often has to deal with large scale data. Several dimensionality reduction techniques like various feature extraction methods have been developed to resolve the issue. However, the geometric view of the applied measure, as an additional consid...

متن کامل

Prediction of user's trustworthiness in web-based social networks via text mining

In Social networks, users need a proper estimation of trust in others to be able to initialize reliable relationships. Some trust evaluation mechanisms have been offered, which use direct ratings to calculate or propagate trust values. However, in some web-based social networks where users only have binary relationships, there is no direct rating available. Therefore, a new method is required t...

متن کامل

A computational method to analyze the similarity of biological sequences under uncertainty

In this paper, we propose a new method to analyze the difference and similarity of biological sequences, based on the fuzzy sets theory. Considering the sequence order and some chemical and structural properties, we present a computational method to cluster the biological sequences. By some examples, we show that the new method is relatively easy and we are able to compare the sequences of arbi...

متن کامل

Intrusion detection using text processing techniques with a kernel based similarity measure

This paper focuses on intrusion detection based on system call sequences using text processing techniques. It introduces kernel based similarity measure for the detection of host-based intrusions. The k-nearest neighbour (kNN) classifier is used to classify a process as either normal or abnormal. The proposed technique is evaluated on the DARPA-1998 database and its performance is compared with...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017